feat: support pretrained_to_huggingface functionality for CosyVoice3 RL trainging by Sakkana · Pull Request #1890 · FunAudioLLM/CosyVoice

Sakkana · 2026-05-18T17:14:07Z

support pretrained_to_huggingface functionality for CosyVoice3 RL trainging

Summary

Support pretrained torch model conversion to huggingface model for RL training.

1. Token Design

Item	CosyVoice2	CosyVoice3
Base speech tokens	6561	6561
Extra control tokens	`<\|eos1\|>` `<\|eos2\|>` `<\|eos3\|>` `<\|sos\|>` `<\|task_id\|>` (+5)	200 extended slots + `<\|sos\|>` `<\|eos\|>` `<\|task_id\|>` (+203)
total_speech_tokens	6564	6761

CV3 folds control tokens into the speech token space and uses an alias map to redirect them. CV2 simply appends them after the vocab.

2. Special Token Vocabulary

CV3 introduces phoneme-level tokens absent in CV2:

English ARPAbet: [AA], [AE], [AH], [B], [CH] ...
Mandarin pinyin with tones: [ā], [ǎo], [iāng], [uán] ...
New system control token: <|endofsystem|>

3. lm_head Construction

	CosyVoice2	CosyVoice3
Bias	Yes, initialized to `-inf`	No (`bias=False`)
Weight injection	Absolute offset indexing	`slice(speech_start_idx, speech_end_idx)`
Alias token handling	None	Copies weights from source token into alias token rows

4. Input Embeddings

CV2 explicitly copies llm_embedding weights for <|sos|> and <|task_id|> into the input embedding table. CV3 drops llm_embedding entirely and handles everything through the alias mechanism.

5. EOS Token Configuration

CV2 registers three separate EOS token IDs:

eos_token_ids = [offset+6561, offset+6562, offset+6563]

CV3 uses both alias and real IDs as a dual fallback:

llm.generation_config.eos_token_id = [alias_eos_token_id, real_eos_token_id]

Test

GPU: 8 x B200 + Triton reward server (SenseVoice) + Verl + GRPO adv_estimator

- gt_text: 
Nathy is still leading by fifteen thousand! We need one gift to unlock our bonus mission. Who is saving us?, 
- hyp_text: 
nay is still leading by fifteen thousand we need one gift to unlock our bonus mission who is saving us, 
reward_val: 0.851114966376682

Note:

Currently, the WER calculation and reward functions are self-defined. For ASR, we adopt the original SenseVoice implementation from the repository only for demonstration purposes without text frontend regularization. Using Whisper can achieve better performance.

…ining

support pretrained_to_huggingface functionality for CosyVoice3 RL tra…

b950ffe

…ining

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support pretrained_to_huggingface functionality for CosyVoice3 RL trainging#1890

feat: support pretrained_to_huggingface functionality for CosyVoice3 RL trainging#1890
Sakkana wants to merge 1 commit into
FunAudioLLM:mainfrom
Sakkana:main

Sakkana commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sakkana commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

support pretrained_to_huggingface functionality for CosyVoice3 RL trainging

Summary

1. Token Design

2. Special Token Vocabulary

3. lm_head Construction

4. Input Embeddings

5. EOS Token Configuration

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sakkana commented May 18, 2026 •

edited

Loading